Data Engineers play a critical role in developing products that are designed to help unlock the power of data. In this project we will focus on few keys areas within the space of Data Engineering and Business Intelligence
Before we start, please follow the steps below.
Install Python on you computer - We will provide a handout with steps to download the python to your machine
Please install the required modules needed to run the python program
Go to Terminal for MAC or Command Prompt for Windows and run below commands
Pandas - Python Dataframe pip install pandas
Plotly - Library to represent data graphically pip install plotly
Note: We will share instructions for installation separately
Lets Start Writing the Code....
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import plotly.express as px
For the purpose of this exercise, we will use World University Rankings dataset available on Kaggle: https://www.kaggle.com/datasets/r1chardson/the-world-university-rankings-2011-2023
Please download the files and store it in your computer.
We are looking at data from 2022. Please use other years' data as needed.
We will use pandas library to read a CSV format and leverage built-in functions to investigate the data.
Once the file are downloaded to your local machine. Provide a Path
Sample : /Users/username/Downloads/NCWiT Summer 2023/University rankings/2022_rankings.csv
# For this exercise, we have stored the files in the directory named "dataset"
# Please change this to the correct path based on where you have downloaded the dataset
filepath = '../dataset/2022_rankings.csv'
df = pd.read_csv(filepath)
print(df)
rank_order rank name scores_overall \
0 10 1 University of Oxford 95.7
1 20 =2 California Institute of Technology 95.0
2 30 =2 Harvard University 95.0
3 40 4 Stanford University 94.9
4 50 =5 University of Cambridge 94.6
... ... ... ... ...
2107 1000446 Reporter Yaşar University NaN
2108 1000447 Reporter Yenepoya University NaN
2109 1000448 Reporter Yogyakarta State University NaN
2110 1000449 Reporter York St John University NaN
2111 1000450 Reporter Ziauddin University NaN
scores_overall_rank scores_teaching scores_teaching_rank \
0 10 91.0 5
1 20 93.6 2
2 30 94.5 1
3 40 92.3 3
4 50 90.9 6
... ... ... ...
2107 1000446 NaN 0
2108 1000447 NaN 0
2109 1000448 NaN 0
2110 1000449 NaN 0
2111 1000450 NaN 0
scores_research scores_research_rank scores_citations ... \
0 99.6 1 98.0 ...
1 96.9 4 97.8 ...
2 98.9 3 99.2 ...
3 96.8 5 99.9 ...
4 99.5 2 96.2 ...
... ... ... ... ...
2107 NaN 0 NaN ...
2108 NaN 0 NaN ...
2109 NaN 0 NaN ...
2110 NaN 0 NaN ...
2111 NaN 0 NaN ...
scores_international_outlook_rank location \
0 26 United Kingdom
1 167 United States
2 209 United States
3 211 United States
4 32 United Kingdom
... ... ...
2107 0 Turkey
2108 0 India
2109 0 Indonesia
2110 0 United Kingdom
2111 0 Pakistan
stats_number_students stats_student_staff_ratio \
0 20,835 10.7
1 2,233 6.3
2 21,574 9.5
3 16,319 7.3
4 19,680 11.1
... ... ...
2107 6,847 13.0
2108 3,104 6.1
2109 24,988 20.3
2110 6,030 18.0
2111 4,018 9.3
stats_pc_intl_students stats_female_male_ratio \
0 42% 47 : 53
1 34% 36 : 64
2 24% 50 : 50
3 23% 46 : 54
4 39% 47 : 53
... ... ...
2107 2% 53 : 47
2108 0% 67 : 33
2109 1% 72 : 28
2110 8% 66 : 34
2111 1% 60 : 40
aliases \
0 University of Oxford
1 California Institute of Technology caltech
2 Harvard University
3 Stanford University
4 University of Cambridge
... ...
2107 Yaşar University
2108 Yenepoya University
2109 Yogyakarta State University
2110 York St John University
2111 Ziauddin University
subjects_offered closed unaccredited
0 Accounting & Finance,General Engineering,Commu... False False
1 Languages, Literature & Linguistics,Economics ... False False
2 Mathematics & Statistics,Civil Engineering,Lan... False False
3 Physics & Astronomy,Computer Science,Politics ... False False
4 Business & Management,General Engineering,Art,... False False
... ... ... ...
2107 Art, Performing Arts & Design,Mechanical & Aer... False False
2108 Medicine & Dentistry,Biological Sciences,Other... False False
2109 Civil Engineering,Physics & Astronomy,Educatio... False False
2110 Biological Sciences,General Engineering,Geogra... False False
2111 Business & Management,Biological Sciences,Medi... False False
[2112 rows x 24 columns]
Now that we have loaded the data, its time for us to explore the dataset and understand the key elements.
df.shape
(2112 (0x840), 24 (0x18))
We see that the data has 2112 rows and 24 columns. Let us print the first 5 rows (along with the columns). This will give us an idea about what the data looks like.
df.head(5)
| rank_order | rank | name | scores_overall | scores_overall_rank | scores_teaching | scores_teaching_rank | scores_research | scores_research_rank | scores_citations | ... | scores_international_outlook_rank | location | stats_number_students | stats_student_staff_ratio | stats_pc_intl_students | stats_female_male_ratio | aliases | subjects_offered | closed | unaccredited | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 10 | 1 | University of Oxford | 95.7 | 10 | 91.0 | 5 | 99.6 | 1 | 98.0 | ... | 26 | United Kingdom | 20,835 | 10.7 | 42% | 47 : 53 | University of Oxford | Accounting & Finance,General Engineering,Commu... | False | False |
| 1 | 20 | =2 | California Institute of Technology | 95.0 | 20 | 93.6 | 2 | 96.9 | 4 | 97.8 | ... | 167 | United States | 2,233 | 6.3 | 34% | 36 : 64 | California Institute of Technology caltech | Languages, Literature & Linguistics,Economics ... | False | False |
| 2 | 30 | =2 | Harvard University | 95.0 | 30 | 94.5 | 1 | 98.9 | 3 | 99.2 | ... | 209 | United States | 21,574 | 9.5 | 24% | 50 : 50 | Harvard University | Mathematics & Statistics,Civil Engineering,Lan... | False | False |
| 3 | 40 | 4 | Stanford University | 94.9 | 40 | 92.3 | 3 | 96.8 | 5 | 99.9 | ... | 211 | United States | 16,319 | 7.3 | 23% | 46 : 54 | Stanford University | Physics & Astronomy,Computer Science,Politics ... | False | False |
| 4 | 50 | =5 | University of Cambridge | 94.6 | 50 | 90.9 | 6 | 99.5 | 2 | 96.2 | ... | 32 | United Kingdom | 19,680 | 11.1 | 39% | 47 : 53 | University of Cambridge | Business & Management,General Engineering,Art,... | False | False |
5 rows × 24 columns
To get the number of rows and columns, use the shape function.
rows = df.shape[0]
cols = df.shape[1]
print(rows)
2112
print(cols)
24
Let us also take a look at what columns exist in the dataset.
df.columns
Index(['rank_order', 'rank', 'name', 'scores_overall', 'scores_overall_rank',
'scores_teaching', 'scores_teaching_rank', 'scores_research',
'scores_research_rank', 'scores_citations', 'scores_citations_rank',
'scores_industry_income', 'scores_industry_income_rank',
'scores_international_outlook', 'scores_international_outlook_rank',
'location', 'stats_number_students', 'stats_student_staff_ratio',
'stats_pc_intl_students', 'stats_female_male_ratio', 'aliases',
'subjects_offered', 'closed', 'unaccredited'],
dtype='object')
nan_values=df.isna()
nan_columns=nan_values.any()
nan_columns
rank_order False rank False name False scores_overall True scores_overall_rank False scores_teaching True scores_teaching_rank False scores_research True scores_research_rank False scores_citations True scores_citations_rank False scores_industry_income True scores_industry_income_rank False scores_international_outlook True scores_international_outlook_rank False location False stats_number_students False stats_student_staff_ratio False stats_pc_intl_students False stats_female_male_ratio True aliases False subjects_offered False closed False unaccredited False dtype: bool
Often, you will have a huge dataset but you will be interested in only a subset. It is helpful to understand how you can extract interesting/relevant data to reduce the size of the dataset that you need to work with.
For this exercise, let us focus on these columns only:
And let us only look at the top-200 universities based on the rank.
To get top-N rows sorted by a column (rank_order), we will use this function: DataFrame.sort_values as described here: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_values.html
In our case, we need to pass the following args:
We will also use "Slicing" in Python to get the first-N values. For example, to get the first 10 elements in an array/list, you will use: arr[:10]
top_200_universities = df.sort_values(by='rank_order', axis=0)[:200]
top_200_universities.shape
(200 (0xc8), 24 (0x18))
As you see above, we now have a dataframe which has only 200 rows.
Now let us extract specific columns. To do this, you will need to create a list of columns that you want to keep. Then, you will use that list to filter the original dataframe to get the columns you need.
For example, if you only need col2 and col6 from a dataframe, you will use this:
df = df[['col2', 'col6']]
cols_to_keep = ['rank_order', 'rank', 'name', 'location', 'subjects_offered']
# Create a copy so that the original dataset remains unchanged
df_2022_rankings = top_200_universities[cols_to_keep].copy()
df_2022_rankings
| rank_order | rank | name | location | subjects_offered | |
|---|---|---|---|---|---|
| 0 | 10 | 1 | University of Oxford | United Kingdom | Accounting & Finance,General Engineering,Commu... |
| 1 | 20 | =2 | California Institute of Technology | United States | Languages, Literature & Linguistics,Economics ... |
| 2 | 30 | =2 | Harvard University | United States | Mathematics & Statistics,Civil Engineering,Lan... |
| 3 | 40 | 4 | Stanford University | United States | Physics & Astronomy,Computer Science,Politics ... |
| 4 | 50 | =5 | University of Cambridge | United Kingdom | Business & Management,General Engineering,Art,... |
| ... | ... | ... | ... | ... | ... |
| 195 | 1960 | 196 | Medical University of Graz | Austria | Medicine & Dentistry,Other Health |
| 196 | 1970 | =197 | University of Erlangen-Nuremberg | Germany | Archaeology,Computer Science,Sport Science,Bio... |
| 197 | 1980 | =197 | University of Geneva | Switzerland | Politics & International Studies (incl Develop... |
| 198 | 1990 | =197 | University of Hohenheim | Germany | Business & Management,Economics & Econometrics... |
| 199 | 2000 | =197 | Sapienza University of Rome | Italy | Politics & International Studies (incl Develop... |
200 rows × 5 columns
As you can see above, you are left with 200 rows and 5 columns that are of interest.
Sometimes the data may not be in a clean form for us to use for furthur analysis. For Example rank column above you can see there is an extra "Equal to" sign. Lets remove that.
You have to access the str attribute per http://pandas.pydata.org/pandas-docs/stable/text.html
df_2022_rankings['rank'] = df_2022_rankings['rank'].str.replace('=', '')
df_2022_rankings
| rank_order | rank | name | location | subjects_offered | |
|---|---|---|---|---|---|
| 0 | 10 | 1 | University of Oxford | United Kingdom | Accounting & Finance,General Engineering,Commu... |
| 1 | 20 | 2 | California Institute of Technology | United States | Languages, Literature & Linguistics,Economics ... |
| 2 | 30 | 2 | Harvard University | United States | Mathematics & Statistics,Civil Engineering,Lan... |
| 3 | 40 | 4 | Stanford University | United States | Physics & Astronomy,Computer Science,Politics ... |
| 4 | 50 | 5 | University of Cambridge | United Kingdom | Business & Management,General Engineering,Art,... |
| ... | ... | ... | ... | ... | ... |
| 195 | 1960 | 196 | Medical University of Graz | Austria | Medicine & Dentistry,Other Health |
| 196 | 1970 | 197 | University of Erlangen-Nuremberg | Germany | Archaeology,Computer Science,Sport Science,Bio... |
| 197 | 1980 | 197 | University of Geneva | Switzerland | Politics & International Studies (incl Develop... |
| 198 | 1990 | 197 | University of Hohenheim | Germany | Business & Management,Economics & Econometrics... |
| 199 | 2000 | 197 | Sapienza University of Rome | Italy | Politics & International Studies (incl Develop... |
200 rows × 5 columns
Now that you have a subset of the data, you can use that to answer questions that help you make decisions. Let us try to answer some questions like:
To check whether "Computer Science" is offered in a university, we will use str.contains function for a column.
This function will return a Boolean Series containing True or False depending on whether the pattern was found or not. See documentation here: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.contains.html
We will also add another column is_computer_science that will specify whether or not Computer Science is offered for that university.
df_2022_rankings['is_computer_science'] = df_2022_rankings['subjects_offered'].str.contains('Computer Science')
df_2022_rankings
| rank_order | rank | name | location | subjects_offered | is_computer_science | |
|---|---|---|---|---|---|---|
| 0 | 10 | 1 | University of Oxford | United Kingdom | Accounting & Finance,General Engineering,Commu... | True |
| 1 | 20 | 2 | California Institute of Technology | United States | Languages, Literature & Linguistics,Economics ... | True |
| 2 | 30 | 2 | Harvard University | United States | Mathematics & Statistics,Civil Engineering,Lan... | True |
| 3 | 40 | 4 | Stanford University | United States | Physics & Astronomy,Computer Science,Politics ... | True |
| 4 | 50 | 5 | University of Cambridge | United Kingdom | Business & Management,General Engineering,Art,... | True |
| ... | ... | ... | ... | ... | ... | ... |
| 195 | 1960 | 196 | Medical University of Graz | Austria | Medicine & Dentistry,Other Health | False |
| 196 | 1970 | 197 | University of Erlangen-Nuremberg | Germany | Archaeology,Computer Science,Sport Science,Bio... | True |
| 197 | 1980 | 197 | University of Geneva | Switzerland | Politics & International Studies (incl Develop... | True |
| 198 | 1990 | 197 | University of Hohenheim | Germany | Business & Management,Economics & Econometrics... | False |
| 199 | 2000 | 197 | Sapienza University of Rome | Italy | Politics & International Studies (incl Develop... | True |
200 rows × 6 columns
As you can see above, a new column is_computer_science is created which contains True or False depending on whether the course is offered or not.
We will use the same "Slicing" as before to get the top-10 universities. Before that, we want to also get the universities that DO offer Computer Science. We will use the is_computer_science column from before and get only those rows where the value is True.
cs_universities = df_2022_rankings[df_2022_rankings['is_computer_science']==True]
cs_universities
| rank_order | rank | name | location | subjects_offered | is_computer_science | |
|---|---|---|---|---|---|---|
| 0 | 10 | 1 | University of Oxford | United Kingdom | Accounting & Finance,General Engineering,Commu... | True |
| 1 | 20 | 2 | California Institute of Technology | United States | Languages, Literature & Linguistics,Economics ... | True |
| 2 | 30 | 2 | Harvard University | United States | Mathematics & Statistics,Civil Engineering,Lan... | True |
| 3 | 40 | 4 | Stanford University | United States | Physics & Astronomy,Computer Science,Politics ... | True |
| 4 | 50 | 5 | University of Cambridge | United Kingdom | Business & Management,General Engineering,Art,... | True |
| ... | ... | ... | ... | ... | ... | ... |
| 193 | 1940 | 193 | Queensland University of Technology | Australia | Languages, Literature & Linguistics,Education,... | True |
| 194 | 1950 | 193 | Texas A&M University | United States | Art, Performing Arts & Design,Sociology,Chemis... | True |
| 196 | 1970 | 197 | University of Erlangen-Nuremberg | Germany | Archaeology,Computer Science,Sport Science,Bio... | True |
| 197 | 1980 | 197 | University of Geneva | Switzerland | Politics & International Studies (incl Develop... | True |
| 199 | 2000 | 197 | Sapienza University of Rome | Italy | Politics & International Studies (incl Develop... | True |
193 rows × 6 columns
As you can see, the number of rows is now reduced to 193. Now out of these, we want the top 10 ranked universities.
top_cs_universities = cs_universities[:10]
top_cs_universities[['name', 'rank', 'location']]
| name | rank | location | |
|---|---|---|---|
| 0 | University of Oxford | 1 | United Kingdom |
| 1 | California Institute of Technology | 2 | United States |
| 2 | Harvard University | 2 | United States |
| 3 | Stanford University | 4 | United States |
| 4 | University of Cambridge | 5 | United Kingdom |
| 5 | Massachusetts Institute of Technology | 5 | United States |
| 6 | Princeton University | 7 | United States |
| 7 | University of California, Berkeley | 8 | United States |
| 8 | Yale University | 9 | United States |
| 9 | The University of Chicago | 10 | United States |
We will use the same "Slicing" as before to get the top-10 universities.
We also have the is_computer_science column from before that we will use to get universities offering Computer Science.
We now need to get universities whose location is "United States"
cs_universities_us = cs_universities[cs_universities['location']=='United States']
cs_universities_us.head(5)
| rank_order | rank | name | location | subjects_offered | is_computer_science | |
|---|---|---|---|---|---|---|
| 1 | 20 | 2 | California Institute of Technology | United States | Languages, Literature & Linguistics,Economics ... | True |
| 2 | 30 | 2 | Harvard University | United States | Mathematics & Statistics,Civil Engineering,Lan... | True |
| 3 | 40 | 4 | Stanford University | United States | Physics & Astronomy,Computer Science,Politics ... | True |
| 5 | 60 | 5 | Massachusetts Institute of Technology | United States | Mathematics & Statistics,Languages, Literature... | True |
| 6 | 70 | 7 | Princeton University | United States | Languages, Literature & Linguistics,Biological... | True |
As you can see, now we have a dataframe containing only universities in the US. It is straightforward to extract the top-10 using the slicing method now.
top_cs_universities_us = cs_universities_us[:10]
top_cs_universities_us[['name', 'rank', 'location']]
| name | rank | location | |
|---|---|---|---|
| 1 | California Institute of Technology | 2 | United States |
| 2 | Harvard University | 2 | United States |
| 3 | Stanford University | 4 | United States |
| 5 | Massachusetts Institute of Technology | 5 | United States |
| 6 | Princeton University | 7 | United States |
| 7 | University of California, Berkeley | 8 | United States |
| 8 | Yale University | 9 | United States |
| 9 | The University of Chicago | 10 | United States |
| 10 | Columbia University | 11 | United States |
| 12 | Johns Hopkins University | 13 | United States |
Let us cross-check the number of rows in the top CS universities. This should be equal to 10.
print(top_cs_universities_us.shape[0])
10
There are multiple ways to represent the data graphically. Please go through this link to see multiple options https://plotly.com/python/plotly-express/
Now that we have Top CS Universities in US dataset avaliable readily, we can plot them in a graph
Graph Examples : Histogram, bar, Line, Area etc.
# Histogram Chart Below
fig = px.histogram(top_cs_universities_us, y='rank', x='name',text_auto=True)
fig.show()
fig = px.pie(top_200_universities, values='stats_student_staff_ratio', names='location',
title='Staff to Student Ratio')
fig.show()